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Abstract 

O | In the paper, we consider the problem of link prediction in time-evolving graphs. 

D _ We assume that certain graph features, such as the node degree, follow a vector 

■ autoregressive (VAR) model and we propose to use this information to improve 

the accuracy of prediction. Our strategy involves a joint optimization procedure 
over the space of adjacency matrices and VAR matrices which takes into account 
both sparsity and low rank properties of the matrices. Oracle inequalities are de- 
rived and illustrate the trade-offs in the choice of smoothing parameters when 
modeling the joint effect of sparsity and low rank property. The estimate is com- 
puted efficiently using proximal methods through a generalized forward-backward 
agorithm. 

1 Introduction 

Forecasting systems behavior with multiple responses has been a challenging issue in many contexts 
of applications such as collaborative filtering, financial markets, or bioinformatics, where responses 
can be, respectively, movie ratings, stock prices, or activity of genes within a cell. Statistical model- 
ing techniques have been widely investigated in the context of multivariate time series either in the 
multiple linear regression setup J4) or with autoregressive models l25l . More recently, kernel -based 
regularized methods have been developed for multitask learning J7][2). These approaches share the 
use of the correlation structure among input variables to enrich the prediction on every single output. 
Often, the correlation structure is assumed to be given or it is estimated separately. A discrete en- 
coding of correlations between variables can be modeled as a graph so that learning the dependence 
structure amounts to performing graph inference through the discovery of uncovered edges on the 
graph. The latter problem is interesting per se and it is known as the problem of link prediction 
where it is assumed that only a part of the graph is actually observed |fT6l l9l. This situation occurs 
in various applications such as recommender systems, social networks, or proteomics, and the ap- 
propriate tools can be found among matrix completion techniques l22l l5lfTI. In the realistic setup 
of a time-evolving graph, matrix completion was also used and adapted to take into account the 
dynamics of the features of the graph |fl9l . In this paper, we study the prediction problem where the 
observation is a sequence of graphs adjacency matrices (A t )o<t<T and the goal is to predict At+i. 
This type of problem arises in applications such as recommender systems where, given informa- 
tion on purchases made by some users, one would like to predict future purchases. In this context, 
users and products can be modeled as the nodes of a bipartite graph, while purchases or clicks are 
modeled as edges. In functional genomics and systems biology, estimating regulatory networks in 
gene expression can be performed by modeling the data as graphs and fitting predictive models is 
a natural way for estimating evolving networks in these contexts. A large variety of methods for 
link prediction only consider predicting from a single static snapshot of the graph - this includes 
heuristics lfl6l |2TI . matrix factorization |T3), diffusion ifTTl . or probabilistic methods l23l . More 
recently, some works have investigated using sequences of observations of the graph to improve the 
prediction, such as using regression on features extracted from the graphs 1191 , using matrix factor- 
ization iTBll . continuous-time regression l27ll . Our main assumption is that the network effect is a 
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cause and a symptom at the same time, and therefore, the edges and the graph features should be 
estimated simultaneously. We propose a regularized approach to predict the uncovered links and the 
evolution of the graph features simultaneously. We provide oracle bounds under the assumption that 
the noise sequence has subgaussian tails and we prove that our procedure achieves a trade-off in the 
calibration of smoothing parameters which adjust with the sparsity and the rank of the unknown ad- 
jacency matrix. The rest of this paper is organized as follows. In Section 2, we describe the general 
setup of our work with the main assumptions and we formulate a regularized optimization problem 
which aims at jointly estimating the autoregression parameters and predicting the graph. In Section 
3, we provide technical results with oracle inequalities and other theoretical guarantees on the joint 
estimation-prediction. Section 4 is devoted to the description of the numerical simulations which 
illustrate our approach. We also provide an efficient algorithm for solving the optimization prob- 
lem and show empirical results. The proof of the theoretical results are provided as supplementary 
material in a separate document. 

2 Estimation of low-rank graphs with autoregressive features 

Our approach is based on the asumption that features can explain most of the information contained 
in the graph, and that these features are evolving with time. We make the following assumptions 
about the sequence (A t )t>o of adjacency matrices of the graphs sequence. 

Low-Rank. We assume that the matrices A t have low-rank. This reflects the presence of highly 
connected groups of nodes such as communities in social networks, or product categories and groups 
of loyal/fanatic users in a market place data, and is sometimes motivated by the small number of 
factors that explain nodes interactions. 

Autoregressive linear features. We assume to be given a linear map lu : R nxn ->• R d defined by 

w(A) = (<ni,A>,... ,(n d ,A)), (i) 

where (0,)i<i<d is a set of n x n matrices. These matrices can be either deterministic or random in 
our theoretical analysis, but we take them deterministic for the sake of simplicity. The vector time 
series (uj(A t ))t>o has autoregressive dynamics, given by a VAR (Vector Auto-Regressive) model: 

uj(A t+1 ) = Wju(A t ) + N t+1 , 

where Wq £ M. dxd is a unknown sparse matrix and (N t )t>o is a sequence of noise vectors in K d . 
An example of linear features is the degree (i.e. number of edges connected to each node, or the sum 
of their weights if the edges are weighted), which is a measure of popularity in social and commerce 
networks. Introducing 

X T _! = (uj(A ),...,uj(A t _ 1 )) t and X T = (w(4i), . . . , uj(A t )) t , 
which are both T x d matrices, we can write this model in a matrix form: 

X r = X T _ 1 T^ + N T , (2) 

where N T = (JVi, . . . , N T ) T . 

This assumes that the noise is driven by time-series dynamics (a martingale increment), where each 
coordinates are independent (meaning that features are independently corrupted by noise), with a 
sub-gaussian tail and variance uniformly bounded by a constant a 2 . In particular, no independence 
assumption between the N t is required here. 

Notations. The notations j| • \\p, \\ ■ \\ p , || • jjoo, || ■ ||* and || • || op stand, respectively, for the Frobenius 
norm, entry- wise i v norm, entry-wise norm, trace-norm (or nuclear norm, given by the sum of the 
singular values) and operator norm (the largest singular value). We denote by (A, B) = tr(A T B) 
the Euclidean matrix product. A vector in M. d is always understood as a d x 1 matrix. We denote 
by ||^4||o the number of non-zero elements of A. The product A o B between two matrices with 
matching dimensions stands for the Hadamard or entry-wise product between A and B. The matrix 
| A | contains the absolute values of entries of A. The matrix (M) + is the componentwise positive part 
of the matrix M, and sign(M) is the sign matrix associated to M with the convention sign(O) = 
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If A is a n x n matrix with rank r, we write its SVD as A = UT^V T = Y?j=i (T j u j v J wnere 
S = diag(cri, . . . , <r r ) is a r x r diagonal matrix containing the non-zero singular values of A in 
decreasing order, and U = [iti, . . . , u r ], V = \v\, . . . ,v r ] are n x r matrices with columns given by 
the left and right singular vectors of A. The projection matrix onto the space spanned by the columns 
(resp. rows) of A is given by P v = UU T (resp. P v = VV T ). The operator : E. nxn -> W ixn 
given by Va{B) = PjjB + BPy — PijBPy is the projector onto the linear space spanned by the 
matrices u^x 1 and yvj for 1 < j, k < r and M 71 . The projector onto the orthogonal space is 

given by V^{B) = (I — Pjj)B(I — Pv)- We also use the notation a V b = max(a, b). 

2.1 Joint prediction-estimation through penalized optimization 

In order to reflect the autoregressive dynamics of the features, we use a least-squares goodness-of- 
fit criterion that encourages the similarity between two feature vectors at successive time steps. In 
order to induce sparsity in the estimator of Wo, we penalize this criterion using the l\ norm. This 
leads to the following penalized objective function: 

Ji(W) = -L||X T - X T -iWf F + k\\W\\i, 
where k > is a smoothing parameter. 

Now, for the prediction of A?+i, we propose to minimize a least-squares criterion penalized by the 
combination of an l\ norm and a trace-norm. This mixture of norms induces sparsity and a low-rank 
of the adjacency matrix. Such a combination of l\ and trace-norm was already studied in |8t| for the 
matrix regression model, and in l20l for the prediction of an adjacency matrix. 

The objective function defined below exploits the fact that if W is close to Wo, then the features of 
the next graph lj(At+i) should be close to W J ui(At). Therefore, we consider 

J 2 (A,W) = ±\\uj(A) - W t u>(A t )\\ 2 f + t\\A\\, +1 \\A\\ 17 

where r, 7 > are smoothing parameters. The overall objective function is the sum of the two 
partial objectives J\ and J2, which is jointly convex with respect to A and W: 

C(A,W) = ^||X T -X T _ 1 W||^ + K ||W|| 1 + i|| W (A)-W T W (A T )||2 + r|| J 4|U+ 7 || J 4|| 1 , (3) 

If we choose convex cones A C M. nxn and W C R dxd , our joint estimation-prediction procedure is 
defined by 

(A,W)£ argmin C(A,W). (4) 

(A,W)eAxW 

It is natural to take W = R dxd and A = (R+)" xn since there is no a priori on the values of the 
feature matrix Wo, while the entries of the matrix At+i must be positive. 

In the next section we propose oracle inequalities which prove that this procedure can estimate Wq 
and predict At+i at the same time. 

2.2 Main result 

The central contribution of our work is to bound the prediction error with high probability under the 
following natural hypothesis on the noise process. 

Assumption 1. We assume that {N t )t>o satisfies E[A' t |J r t _i] = for any t > 1 and that there is 
a > such that for any A £ R and j = 1, . . . , d and t > 0: 

E[e A ^|J- t _i] <e ff2A2 / 2 . 
Moreover, we assume that for each t > 0, the coordinates (N t )i, . . . , (N t )d are independent. 

The main result can be summarized as follows. The prediction error and the estimation error can be 
simultaneously bounded by the sum of three terms that involve homogeneously (a) the sparsity, (b) 
the rank of the adjacency matrix At+i, and (c) the sparsity of the VAR model matrix Wo- The tight 
bounds we obtain are similar to the bounds of the Lasso and are upper bounded by: 
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The positive constants C\, C2, C3 are proportional to the noise level a. The interplay between the 
rank and sparsity constraints on A^+i are reflected in the observation that the values of C2 and C3 
can be changed as long as their sum remains constant. 

3 Oracle inequalities 

In this section we give oracle inequalities for the mixed prediction-estimation error which is given, 
for any A £ K" xn and W £ R dxd , by 

£{A,W) 2 = h{W -Wo) T w{A T ) - u{A- A T+1 )\\l + ^\\K T -i{W - W )\\ 2 F . (5) 

It is important to have in mind that an upper-bound on £ implies upper-bounds on each of 
its two components. It entails in particular an upper-bound on the feature estimation error 
||X T _ 1 (V7 — W )||f tnat makes \\(W — W ) t uj(A t )\\ 2 smaller and consequently controls the 
prediction error over the graph edges through \\w(A — Ar +1 )|| 2 . 

The upper bounds on £ given below exhibit the dependence of the accuracy of estimation and pre- 
diction on the number of features d, the number of edges n and the number T of observed graphs in 
the sequence. 

Let us recall Ny = (Ni, . . . , Nt) t and introduce the noise processes 

d T+1 

M = -^(Nt+i)^ and H = ]T u(A t ^)Nj , 
3=1 t=i 

which are, respectively, n x n and d x d random matrices. The source of randomness comes from 
the noise sequence (N t )t>o, see AssumptionQ] If these noise processes are controlled correctly, we 
can prove the following oracle inequalities for procedure @. The next result is an oracle inequality 
of slow type (see for instance [3|), that holds in full generality. 

Theorem 1. Let {A, W) be given by © and suppose that 

t>^||M|| op , 7 > ^T^ IWloo and k^JUeHoo (6) 
d d dl 

for some a £ (0, 1). Then, we have 

£(A,W) 2 < inf {£(A,W) 2 + 2t\\A\\* + 2 1 \\A\\ 1 + 2 k \\W\\ 1 \. 

For the proof of oracle inequalities of fast type, the restricted eigenvalue (RE) condition introduced 
in O and iflOl [TT1 is of importance. Restricted eigenvalue conditions are implied by, and in gen- 
eral weaker than, the so-called incoherence or RIP (Restricted isometry property, J6)) assumptions, 
which excludes, for instance, strong correlations between covariates in a linear regression model. 
This condition is acknowledged to be one of the weakest to derive fast rates for the Lasso (see ||26| 
for a comparison of conditions). 

Matrix version of these assumptions are introduced in lfl21 . Below is a version of the RE assumption 
that fits in our context. First, we need to introduce the two restriction cones. 

The first cone is related to the ||VF||i term used in procedure ©. If W € R dxd , we denote by 
Q w = sign(W^) e {0, ±l} dxd the signed sparsity pattern of W and by 6^- <S {0, l} dxd the 
orthogonal sparsity pattern. For a fixed matrix W £ M. dxd and c > 0, we introduce the cone 

d(w,c) = \w' £ w : ||e^o w'Hx < c\\e w o w'wA. 

This cone contains the matrices W' that have their largest entries in the sparsity pattern of W. 
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The second cone is related to mixture of the terms || A\\* and || A\\ i in procedure ©. Before defining 
it, we need further notations and definitions. 

For a fixed A £ M. nxn and c, /3 > 0, we introduce the cone 

C 2 (A, c/3) = {a' eA: \\V^(A')\U + p\\ei o A'\U < c(\\V A (A')\U + P\\Qa ° A'h) }. 

This cone consist of the matrices A' with large entries close to that of A and that are "almost aligned" 
with the row and column spaces of A. The parameter j3 quantifies the interplay between these too 
notions. 

Definition 1 (Restricted Eigenvalue (RE)). For W € W and c > 0, we introduce 

Hi(W,c) = inf {n > : \\G W o W'\\ F < -£=\\X t +iW'\\f, VW e C x ( W, c) } . 
For A G A and c, /3 > 0, we introduce 
ti 2 (A,W,c,j3)=M{(i>0:\\V A {A')\\ F V\\& A oA'\\ F 

< -^=||^' T W (A T )- W (A')|| 2 , VW' €C t (W,c),VA' eC2(A t c,p)}. 

The RE assumption consists of assuming that the constants fa and fa are non-zero. Now we can 
state the following Theorem that gives a fast oracle inequality for our procedure using RE. 

Theorem 2. Let (A, W) be given by and suppose that 

r>^-\\M\\ op , 7> ~ a W lloo and k > ^H^lloc (7) 
for some a G (0, 1). Then, we have 

25 



S(A,Wf < inf U(A,Wf + -fa(A,W) 2 (r & nk(A)r 2 + \\A\\ 0l ^ 
(A,w)eAxw L 18 



+ ^ 1 (W) 2 \\W\\ k 2 }, 
where fa(W) = fa(W,5) and fa(A,W) = fa(A, W, 5, 7/r) (see Definition^}. 
The proofs of Theorems Q] and [2] use tools introduced in lfl2l and (3). 

Note that the residual term from this oracle inequality mixes the notions of sparsity of A and W 
via the terms rank(A), ||A||o and || W\\v. It says that our mixed penalization procedure provides an 
optimal trade-off between fitting the data and complexity, measured by both sparsity and low-rank. 
This is the first result of this nature to be found in literature. 

In the next Theorem|3] we obtain convergence rates for the procedure © by combining Theorem[2] 
with controls on the noise processes. We introduce 
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op ll d 



d , d 

O fL 



3=1 3=1 
^ T+l 

a, = max ujj(At^i) 



3 = 1 3 = 1 3 = 1 

T+l 



which are the (observable) variance terms that naturally appear in the controls of the noise processes. 
We introduce also 

£ T = 2 max loglog ( HSl^zII v T + 1 y e 

which is a small (observable) technical term that comes out of our analysis of the noise process H. 
This term is a small price to pay for the fact that no independence assumption is required on the 
noise sequence (N t )t>o, but only a martingale increment structure with sub-gaussian tails. 
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Theorem 3. Consider the procedure (A, W) given by © with smoothing parameters given by 



/ 2(z + log(2n)) /2(x + 21ogn) 



1 /2e(a: + 2 log d + 4r) 

K = OCTCTo] - 1 



d\ T+l 
for some a 6 (0, 1) and fix a confidence level x > 0. T/jen, we /lave 

£U, W 7 ) 2 < inf \S(A, Wf + 25 H i 2 (A) 2 TB,nk(A)a 2 a 2 v 

(A,W)GAxW I 



2n2 2(z + log(2n)) 
a, op j 



+ 25, 2 (^P|| (l- Q )V^ O o 2(X + ^ 1 ° Sn) 

w/f/i a probability larger than 1 — 17e _a: , where p\ and p 2 are the same as in Theorem\2\ 

The proof of Theorem [3] follows directly from Theorem [2] basic noise control results. In the next 
Theorem, we propose more explicit upper bounds for both the indivivual estimation of Wq and the 
prediction of At+i- 

Theorem 4. Under the same assumptions as in Theorem\3\for any x > the following inequalities 
hold with a probability larger than 1 — 17e~' T ." 

-L\\x T (w~Wo)\\ 2 F 



< 



M a {~\\u{A) - lo(A t+1 )\\ 2 f + (A, W) 2 ( ranker 2 + ||A||o 7 2 )} (8) 



AgA 

25 



r-p^WonWohK 2 

So 



\\W-W \\i < 5^i(V^o) 2 ||W ||ok 



+ 6VW^hVi(Wo) mt A \/j\MA) - lu(A t+1 )\\ 2 f + - M2 (A, ^)2(rank(^)r2 + \\A\\ ^) 



(9) 



(10) 



\\A - A T +i\\* < 5/ii(W ) 2 ||iy ||oK + (6VrankA T+ i + 5/VH4r+i||o)A»2(4r+i) 
x M a ^/-\\u(A) - u{A T+1 )\\ 2 F + -p 2 (A,W)*(mnk(A)T 2 + \\A\W) . 

4 Algorithms and Numerical Experiments 

4.1 Generalized forward-backward algorithm for minimizing C 

We use the algorithm designed in fT8l for minimizing our objective function. Note that this algo- 
rithm is preferable to the method introduced in {19\ as it directly minimizes C jointly in (5, W) 
rather than alternately minimizing in W and S. 



Moreover we use the novel joint penalty from |20j that is more suited for estimating 
graphs. The proximal operator for the trace norm is given by the shrinkage operation, if 
Z = U diag(<7i, • • • , a n )V T is the singular value decomposition of Z, 

P rox r||.|u( 2 ') = U diag((cr 4 - T) + ) t V T . 

Similarly, the proximal operator for the £ i-norm is the soft thresholding operator defined by using 
the entry- wise product of matrices denoted by o: 

P rox 7 ||.||i = s £ n ( z ) ° {\Z\ - 7)+ . 
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The algorithm converges under very mild conditions when the step size 9 is smaller than j-, where 
L is the operator norm of the joint quadratic loss: 

$:(A,W)^ ^\\X t -X t ^W\\ 2 f + ^\\oj(A)-W t uj(A t )\\ 2 f . 



Algorithm 1 Generalized Forward-Backward to Minimize C 

Initialize A, Z x , Z%, W, q = 2 
repeat 

Compute (G A ,G W ) = V A , W $(A, W). 
Compute Z\ = prox g0r | il(2^4 — Z\ — 0G A ) 
Compute Z 2 = prox g( , 7 || ni (2A - Z 2 - 0G A ) 

setA = |ELi z k 

SetW = piax gKMl (W-OG w ) 
until convergence 
return (A, W) minimizing C 



4.2 A generative model for graphs having linearly autoregressive features 

Let Vo <G R" xr be a sparse matrix, V$ its pseudo-inverse such, that V Vq = V^Vq = I r - Fix two 
sparse matrices Wo G M. rxr and J7 € R nxr . Now define the sequence of matrices (A t )t>o for 
t = 1,2, ■ ■ • by 

U t = Ut-iWo + N t 

and 

A t = U t V T + M t 

for i.i.d sparse noise matrices Nt and M t , which means that for any pair of indices with high 

probability (Ni)ij = and (AI t )i,j = 0. We define the linear feature map oj(A) = AV Q T \ and 
point out that 

1. The sequence ^uj(A t ) T ^j = \Ut + M t V T ^ follows the linear autoregressive relation 

tu(A t ) T = Lj(A t ^) T W Q +N t + M t V^ . 

2. For any time index t, the matrix A t is close to UtVo that has rank at most r 

3. The matrices A t and Ut are both sparse by construction. 

4.3 Empirical evaluation 

We tested the presented methods on synthetic data generated as in section ( |4.21 i. In our experiments 
the noise matrices M t and Nt where built by soft-thresholding i.i.d. noise jV(0,<7 2 ). We took as 
input T = 10 successive graph snapshots on n = 50 nodes graphs of rank r = 5. We used d = 10 
linear features, and finally the noise level was set to a — .5. We compare our methods to standard 
baselines in link prediction. We use the area under the ROC curve as the measure of performance 
and report empirical results averaged over 50 runs with the corresponding confidence intervals in 
figure |4~3l The competitor methods are the nearest neighbors (NN) and static sparse and low-rank 
estimation, that is the link prediction algorithm suggested in ll20l . The algorithm NN scores pairs 
of nodes with the number of common friends between them, which is given by A 2 when A is the 

cumulative graph adjacency matrix At = Y2t=a ^* an< ^ tne stat i c s P arse an d low-rank estimation 
is obtained by minimizing the objective \\X — At\\ f + r||X||»+7||X||i, and can be seen as the 
closest static version of our method. The two methods autoregressive low-rank and static low-rank 
are regularized using only the trace-norm, (corresponding to forcing 7 = 0) and are slightly inferior 
to their sparse and low-rank rivals. Since the matrix Vq defining the linear map u> is unknown we 
consider the feature map uj(A) = AV where A F = UY,V T is the SVD of At- The parameters r 
and 7 are chosen by 10-fold cross validation for each of the methods separately. 
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Link prediction performance 




1" rank A J+1 



Figure 1 : Left: performance of algorithms in terms of Area Under the ROC Curve, average and 
confidence intervals over 50 runs. Right: Phase transition diagram. 



4.4 Discussion 

1 . Comparison with the baselines. This experiment sharply shows the benefit of using a tem- 
poral approach when one can handle the feature extraction task. The left-hand plot shows 
that if few snapshots are available (T < 4 in these experiments), then static approaches are 
to be preferred, whereas feature autoregressive approaches outperform as soon as sufficient 
number T graph snapshots are available (see phase transition). The decreasing performance 
of static algorithms can be explained by the fact that they use as input a mixture of graphs 
observed at different time steps. Knowing that at each time step the nodes have specific 
latent factors, despite the slow evolution of the factors, adding the resulting graphs leads to 
confuse the factors. 

2. Phase transition. The right-hand figure is a phase transition diagram showing in which part 
of rank and time domain the estimation is accurate and illustrates the interplay between 
these two domain parameters. 

3. Choice of the feature map u>. In the current work we used the projection onto the vector 
space of the top-r singular vectors of the cumulative adjacency matrix as the linear map tu, 
and this choice has shown empirical superiority to other choices. The question of choosing 
the best measurement to summarize graph information as in compress sensing seems to 
have both theoretical and application potential. Moreover, a deeper understanding of the 
connections of our problem with compressed sensing, for the construction and theoretical 
validation of the features mapping, is an important point that needs several developments. 
One possible approach is based on multi-kernel learning, that should be considered in a 
future work. 

4. Generalization of the method. In this paper we consider only an autoregressive process of 
order 1 . For better prediction accuracy, one could consider mode general models, such as 
vector ARMA models, and use model-selection techniques for the choice of the orders of 
the model. A general modelling based on state-space model could be developed as well. 
We presented a procedure for predicting graphs having linear autoregressive features. Our 
approach can easily be generalized to non-linear prediction through kernel-based methods. 

[Appendix : Proof of propositions] 



A Proofs of the main results 

From now on, we use the notation ||(A,a)||f, = ||A|||, + ||a||| and ({A, a), {B.b)) = (A,B) + (a,b) 
for any A, B E R Txd and a, b e R d . 

Let us introduce the linear mapping $ : M™ x " x R dxd — s- M. Txd x R d given by 
$(A, W) = (-?=Xt-iW, w{A) - W T w(A T )) . 



Using this mapping, the objective (f3]l can be written in the following reduced way: 



C(A,W) = - H=X T ,0 )-<£(A,W) 



T 

Recalling that the error writes, for any A and W: 



\A\ 



k\\W\\ 



£(A, Wf = h\(W - W ) T w(A T ) - lj(A - A T+1 )f F + -L\\X T ^(W - W )f F , 
a dl 



we have 



£{A,W) 2 = ]U{A-A T+ i,W-W Q )\\ F . 



Let us introduce also the empirical risk 

R n (A, W) = i||(-Lx T ,o) W) 

The proofs of TheoremQ]and[2]are based on tools developped in |JT2| and Q. However, the context 
considered here is very different from the setting considered in these papers, so our proofs require a 
different scheme. 



A.l Proof of Theorem[T] 



First, note that 

R n (A,W) - R n {A,W) 
1 

_ d 

Since 



^A,w)\\ F -\\m,w)\\ 2 F 



2{{—=X T , 0), $(,4 - A, W - W)) 
VT 



\{\MA,W)\\ 



\$(A,W)\\ F 



£{A, W) 2 - £{A, W) 2 + -ml -A,W- W), ${A T+1 , W )), 



we have 

R n (A,W) - R n (A, W) 



£(A, W) 2 ~ £(A, W) 2 + -(<P(A - A, W - W), $(A T+1 ,W ) - (-£=X T , 0)) 
a JT 



= £(A, W) 2 - £(A, W) 2 + -A$(A -A.W- W), ( — ^N T , N T+l )). 

d yjT 

The next Lemma will come in handy several times in the proofs. 
Lemma 1. For any A e R nxn and W £ R dxd we have 



((^N T , -N T+1 ),*{A, W)) = ((M, Is), (A, W)) 



T 



(W,E) + (A, M). 



This Lemma follows from a direct computation, and the proof is thus omitted. This Lemma entails, 
together with that 

£(A, W) 2 < £(A, W) 2 + -L(W-W,E) + ^{A-A, M) 

+ t(\\A\U - ||%) +7(Nli - Plli) + <\\W\\i - H^lli). 
Now, using Holder's inequality and the triangle inequality, and introducing a € (0, 1), we obtain 



2a, 



£{A, W) 2 < £(A, W) 2 + [ — \\M\\ op - r ) \\A\\, + ( — ||A/|| op + r ) \\A\\ 



d """ up 
2(1 - a) 



2a, 



Mil 



+ (_||S|| 00 -«)||W|| 1 + 
which concludes the proof of TheoremQ] using ©. 



7)IH|i + 

(~ 
\dT 



2(1 - a) 



HMIU+7 \\A\l, 



^ — IISHoo + kJHWIU, 



□ 



9 



A.2 Proof of Theorem|2] 

Let A G R nxn and W G M dxd be fixed, and let A = £/diag(<Xi, . . . ,cr r )U T be the SVD of A. 
Recalling that o is the entry-wise product, we have A = Q A o\A\+Q^oA, where 0^ G {0, ±l} nxn 
is the entry-wise sign matrix of A and 0^ G {0, l} ,ixn is the orthogonal sparsity pattern of A. 

The definition (|4]i of (A, W) is equivalent to the fact that one can find G G d£(A, W) (an element 
of the subgradient of Cat (A, W)) that belongs to the normal cone of A x W at (A, W). This means 
that for such a G, and any A £ A and W £ W, we have 

(G, (A - A, W - W)) < 0. (11) 
Any subgradient of the function g(A) = t||A||* + 7||A||i writes 

Z = tZ*+ 1 Z 1 = t(uV t + Pj[(G.)) + j(e A + Gi o &\ 



for some ||G*|| p < 1 and HGiHoo < 1 (see for instance |[T5l ). So, if Z G dg(A), we have, by 
monotonicity of the sub-differential, that for any Z G <9p(A) 

(Z,A- A) = (Z- Z,A-A) + (Z,A-A) > (Z,A-A), 
and, by duality, we can find Z such that 

(Z, A - A) = t(UV t 7 A - A) + r\\Vi(A)\U + j(e A , A - A) + 7 ||6i o A|| l 

By using the same argument with the function W i->- || W||i and by computing the gradient of the 
empirical risk (A, W) i— > -R„(A, W), Equation (fTTT i entails that 

§($(A - A T+ i, V? - Wo), $(A — A, W — W)) 

< 1({^N T , -N T+1 ),$(A - A,W - W)) - t(UV t , A - A) - r\\Vi(A)\\* ( 12 ) 

- j(e A ,A-A) - 7 ||0i o - k{Q w ,W- W) - o W||i. 

Using Pythagora's theorem, we have 
2($(A- A T+ i, W- W ),$(A- A, W- W)) 

= ||$(A- A r +i,W- Wo) Hi + ||$(A — A,W — W)\\\ - ||$(A - A T+U W - W )|||. 

(13) 

It shows that if ($(1- A T+ i, W - W ), $(A - A, W ~ W)) < 0, then Theorem|2]trivially holds. 
Let us assume that 

($(A-A T+1 ,W- W ),<f>(A-A,W - W)) >0. (14) 
Using Holder's inequality, we obtain 

|(t/F T ,i- A) | = |((7U T ,^(A- A)) | < \\UV T \\ op \\V A (A-A)\U = \\V A (A- A)\\., 

\(e A ,A-A)\ = \(e A ,e A o(A-A))\ < ue^iu^o (i-A)^ = \\e A o(A- a)\\ u 

and the same is done for |(©w, W — W)| < \\®w ° (W — W)||i. So, when < TT~4T > holds, we obtain 
by rearranging the terms of 

t\\PJ[(A-A)\\. + o (A- A)^ + K \\Qk o (W - W)||! 

< r||P A (A - A)|U + 7!I©a o (A - A)||! + k\\Q w ° - WOlli 



+ H ((-^Nt, -JV t+1 ), $(A - A, W - W)). 
Using Lemma[T| together with Holder's inequality, we have for any a G (0, 1): 

((4=N Tj - A^t+i), *(A — A, W - W)) = (M, A - A) + i(H, W - W) 

< a||M ||„p||Pa(A - A)||* + a||M|| op ||7>i(A - A)||* 
+ (1 - aJHMIUlie^ o (A - A)|| x + (1 - aJHMHooliei o (A - A)|| 

+ ^l|S||oo(||©^o(W-W)|| 1 + ||0^o(W-W)||i). 



(15) 



(16) 
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Now, using (fT3T l together with ( fT6] l, we obtain 

2 «„»r„ \„^/J ,m, , 2(1 -a) 



(r - -^\\M\\ op ) \\Vi(A - A)\U + (7 - -^-^UMIU) ||6i o (A - A) ||i 

+ («-^||H|| 00 ) \\Q^o(W-W)h 
< ( r + ^||M|| op ) \\V A (A - A)\U + (7 + 2(1 ~ a) llMHoo) ||0 A o (A - A)\\! 



+ ^\\E\\ 00 )\\e w o(W~W)\\ 1 
which proves, using (O, that 

r\\Vi(A - A)||, + 7||9i o (A - A)||i < 5t\\V a (A - A)j|* + 5 7 j|e. 4 ° (i - A)|| 



This proves that A — A E C^{A, 5, 7 /t). In the same way, using (fT3T > with A = A together with ( fT6l ), 
we obtain that W ~W e Ci (W, 5). 

Now, using together ( fT2b , dT3l l and ( fToT ) , and the fact that the Cauchy-Schwarz inequality entails 

\\P A (A-A)\\* < VrankA\\V A {A- A)\\ F , \(UV T ,A~ A)\ < V^nkA\\V A (A - A)\\ F} 



119^0(4-^)11! < v^iW|e A o(A-A)|| F , \(q a ,A-a)\ < ^jAf a \\e A o(A-A)\\ F 

and similarly for W — W, we arrive at 

||$(A - A T+U W - Wo)||l + \\HA-A,W- W)\\ 2 2 - ||$(A - A T+1 ,W - W )g 



< 



(^II^IIop + r) V^kA\\V A (A -A)\\ F + (^HMHop - r) \\Vi(A - A)||* 

+ (^\\m\\oo + 7) VWoWQa o (A - A)\\ F + (y\\ m Woo - 7) ||ei o (A - A)\u 
+ Q^mWoo + «) VWhWQw o (w - w)\\ F + (^l|s||oo - «) ||e^ o (w w)h, 

which leads, using (O, to 

i||$(A- A T +i,W- W )||l + i||#(A- A,W-W)\\ 2 2 -~\\$(A-A T+1 ,W-W )\\l 
5t , — - j 57 / .. . .. . - ... 5k 



< —Vrax^A\\P A (A - A)\\ F + -±^Ajo\\Q A o(A-A)\\ f + —^\\W\\o\\e W ° (W - W)|| F . 

Since A — Ae ^(A, 5, 7/r) and W — W <E Ci(W,5), we obtain using Assumption [T] and 

afe< (a 2 +6 2 )/2: 

1||$(A - A T +i,# - W )||i + i||*(A - A, W - W)\\j 

< -||$(A - A T+1 ,W - Wb)||^ + ^2(A, ^) 2 (rank(A)r 2 + ||A||o 7 2 ) 

OK 1 ^ 

+ ^i(W0 2 |WIok; 2 + ||$(A - A, - ^)|| 2 , 
36 a 

which concludes the proof of Theorem [2] □ 
A.3 Proof of Theorem 

For the proof of ®, we simply use the fact that ^||X T _i(H^ - W )|||r < £(A, W) 2 and use 
Theorem[3] Then we take W = Wo in the infimum over A, W. 

For©, we use the fact that since W" — Wo € ^i(Wo, 5), we have (see the Proof of Theorem[2l>, 

\\W - Wo\\i < 6VII w ||o||e w o (W - W )\\f 



< 6Vl|Wo||o||X T -i(W - W Q )\\ F /VdT 
<6y/\\W \\ £(A, W), 

and then use again Theorem[3] The proof of ( TTOb follows exactly the same scheme. □ 
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A.4 Concentration inequalities for the noise processes 



The control of the noise terms M and 3 is based on recent developments on concentration inequal- 
ities for random matrices, see for instance l24l . Moreover, the assumption on the dynamics of 
the features's noise vector (N t )t>o is quite general, since we only assumed that this process is a 
martingale increment. Therefore, our control of the noise H rely in particular on martingale theory. 

Proposition 1. Under Assumption^ the following inequalities hold for any x > 0. We have 



d 



op 



< crfO,op 



with a probability larger than 1 — e x . We have 

d 



^£(iv T+1 )^ 



< O-VQ, 



with a probability larger than 1 — 2e x , and finally 

T+l 



T 



< cro u 



2(a; + log(2n)) 



2{x + 21ogn) 



2e(x + 2\ogd + £ T ) 
T + l 



(17) 



(18) 



(19) 



with a probability larger than 1 — 14e x , where 



£t = 2 max log log 

j=l,...,d 



T + l 



T+l 



V e 



Proof. For the proofs of Inequalities fPTT i and (fT8l , we use the fact that (Nt+i)i, • ■ • , (-^r+i)d are 
independent (scalar) subgaussian random variables. 

From Assumption [T] we have for any n x n deterministic self-adjoint matrices Xj that 
E[exp(A(A^r + i)j Xj)} ^ exp(cr 2 A 2 X?/2), where ^ stands for the semidefinite order on self-adjoint 
matrices. Using Corollary 3.7 from 1241 . this leads for any x > to 



XA>x 



< n 



2 " 
CXP ( ~ ' Wh6re = "1 E * 



Then, following ll24l . we consider the dilation operator C : R rix ™ — > R 2nx2n given by 

£(0)=(jj. 2 



We have 



E^+oa 

and an easy computation gives 



a 



^(AY,,);/^;) 



(20) 



E^(«i 

J'=l 



op 



E°7%| v E n A T 



op 



So, using (|20l with the self-adjoint Xj = C(Ttj) gives 



p[||E(Wr+i)jfi. 
3=1 

which leads easily to ( fTTI i 



> x 



<2ncxp(-|^) where u 2 = cr 2 || E^J^j v |E^' fi 
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Inequality ( fTST l comes from the following standard bound on the sum of independent sub-gaussian 
random variables: 



>[|i]T>T + iM%) 



> X 



< 2 exp - 



2a' 2 (0 



together with an union bound on 1 < fe, I < n. 

Inequality ( fl~9l ) is based on a classical martingale exponential argument together with a peeling 
argument. We denote by ujj (A t ) the coordinates of u>(A t ) G R d and by Nt,k those of N t , so that 

T+l T+l 
I j.k 



t=l 



2<j 2 v 



We fix j, k and denote for short e t = Nt t k an d = Since E[cxp(Aet)| J-t-i] < e cr " A2 / 2 

for any A G R, we obtain by a recursive conditioning with respect to Tt-\, J~t-2, ■ ■ ■ , that 

E[ex P (6 J2 etxt-i - °— E ^ L 

t=i t=i 

Hence, using Markov's inequality, we obtain for any v > 0: 

T+l T+l 

P [ £txt-i > X ?Y1 x t-i - v - l ni ex P(-° x + o 3 6 2 v/2) = exp 
t=i t=i e>0 

that we rewrite in the following way: 

T+l T+l 

P[ e t x t-i > <rV2ra, ^ x\_ x < 
t=i t=i 

Let us denote for short Vt = Et=i and St = Et=i £ t x t-i- We want to replace u 

by Vt from the previous deviation inequality, and to remove the event {Vt < v}. To do so, 
we use a peeling argument. We take v = T + 1 and introduce Vk = ve k so that the event 
{Vt > v} is decomposed into the union of the disjoint sets {v k < Vr < Wfc+i}- We introduce 

also t T = 21oglog C^W.f'- 1 V V e 

This leads to 



< e~ x . 



T+l 



V J +i x 2 



5 T > oV2eVr(x + £r), Vr > v] = ^ P[S T > oV 2 ^^ + £r), "fc <V T < v k+1 

k>0 

= J2 P [ S t > °\j 2w fc +i(x + 2 loglog(e fc V e)), % < V T < Vfc+i 

fe>0 

< e^(l + ^fc- 2 ) < 3.47e" x . 



fe>i 



On {Vr < v} the proof is the same: we decompose onto the disjoint sets {v k +i < Vr < v k } where 
this time v k = ve~ k , and we arrive at 



S T > a v / 2eV T (x + £ T ), V T < v] < 3.47e~ x . 



This leads to 



T+l 



T+l 



t=i t=\ 
for any 1 < j, k < d, where we introduced 



1/2' 



t T ,j = 2 log log 



T+l 



V 



T+l 



£L + i%-(^-i) 2 



< 7e" 



Ve 



The conclusion follows from an union bound on 1 < j, k < d. This concludes the proof of Proposi- 
tion [TJ □ 
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